How do we become confident in the safety of a machine learning system?

Thanks to Rohin Shah, Ajeya Cotra, Richard Ngo, Paul Christiano, Jon Uesato, Kate Woolverton, Beth Barnes, and William Saunders for helpful comments and feedback.

Evaluating proposals for building safe advanced AI—and actually building any degree of confidence in their safety or lack thereof—is extremely difficult. Previously, in “An overview of 11 proposals for building safe advanced AI,” I tried evaluating such proposals on the axes of outer alignment, inner alignment, training competitiveness, and performance competitiveness. While I think that those criteria were good for posing open questions, they didn’t lend themselves well to actually helping us understand what assumptions needed to hold for any particular proposal to work. Furthermore, if you’ve read that paper/​post, you’ll notice that those evaluation criteria don’t even work for some of the proposals on that list, most notably Microscope AI and STEM AI, which aren’t trying to be outer aligned and don’t really have a coherent notion of inner alignment either.

Thus, I think we need a better alternative for evaluating such proposals—and actually helping us figure out what needs to be true for us to be confident in them—and I want to try to offer it in the form of training stories. My hope is that training stories will provide:

  • a general framework through which we can evaluate any proposal for building safe advanced AI,

  • a concise description of exactly what needs to be true for any particular proposal to succeed—and thus what we need to know to be confident in it—and

  • a well-defined picture of the full space of possible proposals, helping us think more broadly regarding new approaches to AI safety, unconstrained by an evaluation framework that implicitly rules out certain approaches.

What’s a training story?

When you train a neural network, you don’t have direct control over what algorithm that network ends up implementing. You do get to incentivize it to have some particular behavior over the training data, so you might say “whatever algorithm it’s implementing, it has to be one that’s good at predicting webtext”—but that doesn’t tell you how your model is going to go about accomplishing that task. But exactly how your model learns to accomplish the task that you give it matters quite a lot, since that’s what determines how your model is going to generalize to new data—which is precisely where most of the safety concerns are. A training story is a story of how you think training is going to go and what sort of model you think you’re going to get at the end, as a way of explaining how you’re planning on dealing with that very fundamental question of how your model is going to learn to accomplish the task that you give it.

Let’s consider cat classification as an example. Right now, if you asked a machine learning researcher what their goal is in training a cat classifier, they’d probably say something like “we want to train a model that distinguishes cats from non-cats.” The problem with that sort of a training story, however, is that it only describes the desired behavior for the model to have, not the desired mechanism for how the model might achieve that behavior. Instead of such “behavioral training stories,” for the rest of the post when I say “training story,” I want to specifically reference mechanistic training stories—stories of how training goes in terms of what sort of algorithm the model you get at the end is implementing, not just behaviorally what your model does on the training distribution. For example, a mechanistic training story for cat classification might look like:

“We want to get a model that’s composed of a bunch of heuristics for detecting cats in images that correspond to the same sorts of heuristics that humans use for cat detection. If we get such a model, we don’t think it’ll be dangerous in any way because we think that human cat detection heuristics alone are insufficient for any sort of dangerous agentic planning, which we think would be necessary for such a model to pose a risk.

Our plan to get such a model is to train a deep convolutional neural network on images of cats and non-cats. We believe that the simplest model that correctly labels a large collection of cat and non-cat images will be one that implements human-like heuristics for cat detection, as we believe that human cat detection heuristics are highly simple and natural for the task of distinguishing cats from non-cats.”

I think that there are a bunch of things that are nice about the above story. First, if the above story is true, it’s sufficient for safety—it precisely describes a story for how training is supposed to go such that the resulting model is safe. Furthermore, such a story makes pretty explicit what could go wrong such that the resulting model wouldn’t be safe—in this case, if the simplest cat-detecting neural network was an agent or an optimization process that terminally valued distinguishing cats from non-cats. I think that explicitly stating what assumptions are being made about what model you’re going to get is important, since at some point you could get an agent/​optimizer rather than just a bunch of heuristics.[1]

Second, such a story is highly falsifiable—in fact, as we now know from work like Ilyas et al.’s “Adversarial Examples Are Not Bugs, They Are Features,” the sorts of cat-detection heuristics that neural networks generally learn are often not very human-like. Of course, I picked this story explicitly because it made plausible claims that we can now actually falsify. Though every training story should have to make falsifiable claims about what mechanistically the model should be doing, those claims could be quite difficult in general to falsify, as our ability to understand anything about what our models are doing mechanistically is quite limited. While this might seem like a failure of training stories, in some sense I think it’s also a strength, as it explicitly makes clear the importance of better tools for analyzing/​falsifying facts about what our models are doing.

Third, training stories like the above can be formulated for essentially any situation where you’re trying to train a model to accomplish a task—not only are training stories useful for complex alignment proposals, as we’ll see later, but they also apply even to simple cat detection, as in the story above. In fact, though it’s what I primarily want them for, I don’t think that there’s any reason that training stories need to be exclusively for large/​advanced/​general/​transformative AI projects. In my opinion, any AI project that has cause to be concerned about risks/​dangers should have a training story. Furthermore, since I think it will likely get difficult to tell in the future whether there should be such cause for concern, I think that the world would be a much better place if every AI project—e.g. every NeurIPS paper—said what their training story was.

Training story components

To help facilitate the creation of good training stories, I’m going to propose that every training story at least have the following basic parts:

  1. The training goal: what sort of algorithm you’re hoping your model will learn and why learning that sort of algorithm will be good. This should be a mechanistic description of the desired model that explains how you want it to work—e.g. “classify cats using human vision heuristics”—not just what you want it to do—e.g. “classify cats.”

  2. The training rationale: why you believe that your training setup will cause your model to learn that sort of algorithm. “Training setup,” here, refers to anything done before the model is released, deployed, or otherwise given the ability to meaningfully impact the world. Importantly, note that a training rationale is _not _a description of what, concretely, will be done to train the model—e.g. “using RL”—but rather a rationale for why you think the various techniques employed will produce the desired training goal—e.g. “we think this RL setup will cause this sort of model to be produced for these reasons.”

Note that there is some tension in the above notion of a training goal, which is that, if you have to know from a mechanistic/​algorithm perspective exactly what you want your model to be doing, then what’s the point of using machine learning if you could just implement that algorithm yourself? The answer to this tension is that the training goal doesn’t need to be quite that precise—but exactly how precise it should be is a tricky question that I’ll go into in more detail in the next section.

For now, within the above two basic parts, I want to break each down into two pieces, giving us the full four components that I think any training story needs to have:

  1. Training goal specification: as complete a specification as possible of exactly what sort of algorithm you’re intending your model to learn. Importantly, the training goal specification should be about the desired sort of algorithm you want your model to be implementing internally, not just the desired behavior that you want your model to have. In other words, the training goal specification should be a mechanistic description of the desired model rather than a behavioral description. Obviously, as I noted previously, the training goal specification doesn’t have to be a full mechanistic description—but it needs to say enough to ensure that any model that meets it is desirable, as in the next component.

  2. Training goal desirability: a description of why learning that sort of algorithm is desirable, both in terms of not causing safety problems and accomplishing the desired goal. Training goal desirability should include why learning any algorithm that meets the training goal specification would be good, rather than just a description of a specific good model that conforms to the training goal.

  3. Training rationale constraints: what constraints you know must hold for the model and why the training goal is consistent with those constraints. For example: the fact that a model trained to zero loss must fit the training data perfectly would be a training rationale constraint, as would be the fact that whatever algorithm the model ends up implementing has to be possible to implement with the given architecture.

  4. Training rationale nudges: why, among all the different sorts of algorithms that are consistent with the training rationale constraints, you think that the training process will end up producing a model that conforms to the desired training goal. This would include arguments like “we think this is the simplest model that fits the data” as in the cat detection training story.

As an example of applying these components, let’s reformulate the cat detection training story using these four basic components:

  1. Training goal specification: The goal is to get a model that’s composed of a bunch of heuristics for detecting cats in images that correspond to the same sorts of heuristics used by humans for cat detection.

  2. Training goal desirability: Such a model shouldn’t be dangerous in any way because we think that human cat detection heuristics alone are insufficient for any sort of dangerous agentic planning, which we think would be necessary for such a model to pose a risk. Furthermore, we think that human cat detection heuristics must be sufficient for cat detection, as we know that humans are capable of detecting cats.

  3. Training rationale constraints: Whatever model we get must be one that correctly classifies cats from non-cats over our training data and is implementable on a deep convolutional neural network. We think that the training goal satisfies these constraints since we think human heuristics are simple enough to be implemented by a CNN and correct enough to classify the training data.

  4. Training rationale nudges: We believe that the simplest model that correctly labels a large collection of cat and non-cat images will be the desired model that implements human-like heuristics for cat detection, as we believe that human cat detection heuristics are highly simple and natural for the task of distinguishing cats from non-cats.

How mechanistic does a training goal need to be?

One potential difficulty in formulating training goals as described above is determining what to specify and what to leave unspecified in the training goal specification. Specify too little and your training goal specification won’t be constraining enough to ensure that any model that meets it is desirable—but specify too much, and why are you even using machine learning in the first place if you already know precisely what algorithm you want the resulting model to implement?

In practice, I think it’s always a good idea to be as precise as you can—so the real question is, how precise do you need to be for a description to work well as a training goal specification? Fundamentally, there are two constraining factors: the first is training goal desirability—the more precise your training goal, the easier to argue that any model that meets it is desirable—and the second is the training rationale—how hard is it actually going to be in practice to ensure that you get that specific training goal.

Though it might seem like these two factors are pushing in opposite directions—training goal desirability towards a more precise goal and the difficulty of formulating a training rationale towards a more general goal—I think that’s actually not true. Formulating a good training rationale can often be much easier for a more precise training goal. For example, if your training goal is “a safe model,” that’s a very broad goal, but an extremely difficult one to ensure that you actually achieve. In fact, I would argue, creating a training rationale for the training goal of “a safe model” is likely to require putting an entire additional training story in your training rationale, as you’ve effectively gone down a level without actually reducing the original problem at all. The factors that, in my opinion, actually make a training goal specification easier to build a training rationale for aren’t generality, but rather questions like how natural the goal is in terms of the inductive biases of the training process, how much it corresponds to aspects of the model that we know how to look for, how easily it can be broken down into individually checkable pieces, etc.

As a concrete example of how precise a training goal should be, I’m going to compare two different ways in which Paul Christiano has described a type of model that he’d like to build.[2] First, consider how Paul describes corrigibility:

I would like to build AI systems which help me:

  • Figure out whether I built the right AI and correct any mistakes I made

  • Remain informed about the AI’s behavior and avoid unpleasant surprises

  • Make better decisions and clarify my preferences

  • Acquire resources and remain in effective control of them

  • Ensure that my AI systems continue to do all of these nice things

  • …and so on

We say an agent is corrigible (article on Arbital) if it has these properties.

In my opinion, a description like the above would do very poorly as a training goal specification. Though Paul’s description of corrigibility specifies a bunch of things that a corrigible model should do, it doesn’t describe them in a way that actually pins down how the model should do those things. Thus, if you try to just build a training rationale for how to get something like the above, I think you’re likely to just get stuck on what sort of model you could try to train that, in the broad space of possible models, would actually have those properties.

Now, compare Paul’s description of corrigibility above to Paul’s description of the “intended model” in “Teaching ML to answer questions honestly instead of predicting human answers:”

The intended model has two parts: (i) a model of the world (and inference algorithm), (ii) a translation between the world-model and natural language. The intended model answers questions by translating them into the internal world-model.

We want the intended model because we think it will generalize “well.” For example, if the world model is good enough to correctly predict that someone blackmails Alice tomorrow, then we hope that the intended model will tell us about the blackmail when we ask (or at least carry on a dialog from which we can make a reasonable judgment about whether Alice is being blackmailed, in cases where there is conceptual ambiguity about terms like “blackmail”).

We want to avoid models that generalize “badly,” e.g. where the model “knows” that Alice is being blackmailed yet answers questions in a way that conceals the blackmail.

Paul’s first paragraph here can clearly be interpreted as a training goal specification with the latter two paragraphs being training goal desirability—and in this case I think this is exactly what a training goal should look like. Paul describes a specific mechanism for how the intended model works—using an honest mapping from its internal world-model to natural language—and explains why such a model would work well and what might go wrong if you instead got something that didn’t quite match that description. In this case, I don’t think that Paul’s training goal specification above would actually work for training a competitive system—and Paul doesn’t intend it that way—but nevertheless, I think it’s a good example of what I think a mechanistic training goal should look like.

Looking forward, I’d like to be able to develop training goals that are even more specific and mechanistic than Paul’s “intended model.” Primarily, that’s because the more specific/​mechanistic we can get our training goals, the more room that we should eventually have for failure in our training rationales—if a training goal is very specific, then even if we miss it slightly, we should hopefully still end up in a safe part of the overall model space. Ideally, as I discuss later, I’d like to have rigorous sensitivity analyses of things like “if the training rationale is slightly wrong in this way, by how much do we miss the training goal”—but getting there is going to require both more specific/​mechanistic training goals as well as a much better understanding of when training rationales can fail. For now, though, I’d like to set the bar for “how mechanistic/​precise should a training goal specification be” to “at least as mechanistic/​precise as Paul’s description above.”

Relationship to inner alignment

The point of training stories is not to do away with concepts like mesa-optimization, inner alignment, or objective misgeneralization. Rather, the point of training stories is to provide a universal framework in which all of those sorts of concepts can live as discrete subproblems—specific ways in which a training story might go wrong.

Thus, here’s my training-stories-centric glossary of many of these other terms that you might encounter around AI safety:

  • Objective misgeneralization: Objective misgeneralization, otherwise called an objective robustness failure or capability generalization without objective generalization, refers to a situation in which the final model matches the desired capabilities of the training goal, but uses those capabilities in a different way or for a different purpose/​objective than the training goal.

    • For example: Suppose your training goal is a model that successfully solves mazes, but in training there’s always a green arrow at the end of each maze. Then, if you ended up with a model with the capability to navigate mazes successfully, but used that capability to go to the green arrow rather than the end of the maze (even when the arrow was no longer at the end), that would be objective misgeneralization. For a slightly more detailed explanation of this example, see “Towards an empirical investigation of inner alignment,” and for an empirical demonstration of it, see Koch et al.’s “Objective Robustness in Deep Reinforcement Learning.”

  • Mesa-optimization: Mesa-optimization refers to any situation in which the model you end up with is internally running some sort of optimization process. Particularly concerning is unintended mesa-optimization, which is a situation in which the model is an optimizer but the training goal didn’t include any sort of optimization.

  • Outer alignment: Outer alignment refers to the problem of finding a loss/​reward function such that the training goal of “a model that optimizes for that loss/​reward function” would be desirable.

  • Inner alignment: Inner alignment refers to the problem of constructing a training rationale that results in a model that optimizes for the loss/​reward function it was trained on.

  • Deceptive alignment: Deceptive alignment refers to the problem of constructing a training rationale that avoids models that are trying to fool the training process into thinking that they’re doing the right thing. For an exploration of how realistic such a problem might be, see Mark Xu’s “Does SGD Produce Deceptive Alignment?

It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/​outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function. However, as I hope the more general framework of training stories should make clear, there are many possible ways of trying to train an aligned model. Microscope AI and STEM AI are examples that I mentioned previously, but in general any approach that intends to use a loss function that would be problematic if directly optimized for, but then attempts to train a model that doesn’t directly optimize for that loss function, would fail on both outer and inner alignment—and yet might still result in an aligned model.

One of my hopes with training stories is that it will help us better think about approaches in the broader space that Microscope AI and STEM AI operate in, rather than just feeling constrained to approaches that fit nicely within the paradigm of inner alignment.

Do training stories capture all possible ways of addressing AI safety?

Though training stories are meant to be a very general framework—more general than outer/​inner alignment, for example—there are still approaches to AI safety that aren’t covered by training stories. For example:

  • Training stories can’t handle approaches to building advanced AI systems that don’t involve a training step, since having a notion of “training” is a fundamental part of the framework. Thus, a non-ML based approach using e.g. explicit hierarchical planning wouldn’t be able to be analyzed under training stories.

  • Training stories can’t handle approaches that aim to gain confidence in a model’s safety without gaining any knowledge of what, mechanistically, the model might be doing, since in such a situation you wouldn’t be able to formulate a training goal. Partly this is by design, as I think that having a clear training goal is a really important part of being able to build confidence in the safety of a training process. However, approaches that manage to give us a high degree of confidence in a model’s safety without giving us any insight into what that model is doing internally are possible and wouldn’t be able to be analyzed under training stories. It’s worth pointing out, however, that just because training stories require a training goal doesn’t mean that they require transparency and interpretability tools or any other specific way of trying to gain insight into what a model might be doing—so long as an approach has some story for what sort of model it wants to train and why that sort of model will be the one that it gets, training stories is perfectly applicable.[3]

  • Training stories can’t handle any approach which attempts to diffuse AI existential risk without actually building safe, advanced AI systems. For example, a proposal for how to convince AI researchers not to build potentially dangerous AIs, though it might be a good way of mitigating AI existential risk, wouldn’t be a proposal that could possibly be analyzed using training stories.

Evaluating proposals for building safe advanced AI

Though I’ve described how I think training stories should be constructed—that is, using the four components I detailed previously—I haven’t explained how I think training stories should be evaluated.

Thus, I want to introduce the following four criteria for evaluating a training story to build safe advanced AI. These criteria are based on the criteria I used in “An overview of 11 proposals for building safe advanced AI,” but adopted for the training stories setting. Note that these criteria should only be used for proposals for advanced/​transformative/​general AI, not just any AI project. Though I think that the general training stories framework is applicable to any AI project, these specific evaluation criteria are only for proposals for building advanced AI systems.

  1. Training goal …

    1. … alignment: whether, if successfully achieved, the training goal would be good for the world—in other words, whether the training goal is aligned with humanity. If the training goal specification is insufficiently precise, then a proposal should fail on training goal alignment if there is any model that meets the training goal specification that would be bad for the world.

    2. … competitiveness: whether, if successfully achieved, the training goal would be powerful enough to compete with other AI systems. That is, a proposal should fail on training goal competitiveness if it would be easily outcompeted by other AI systems that might exist in the world.

  2. Training rationale …

    1. … alignment: whether the training rationale is likely to work in ensuring that the final model conforms to the training goal specification—in other words, whether the final model is aligned with the training goal. Evaluating training rationale alignment necessarily involves evaluating how likely the training rationale constraints and nudges are to successfully ensure that the training process produces a model that matches the training goal.

    2. … competitiveness: how hard the training rationale is to execute. That is, a proposal should fail on training rationale competitiveness if its training rationale is significantly more difficult to implement—e.g. because of compute or data requirements—than competing alternatives.

Case study: Microscope AI

In this section, I want to take a look at a particular concrete proposal for building safe advanced AI that I think is hard to evaluate properly without training stories, and show that, with training stories, we can easily make sense of what it’s trying to do and how it might or might not succeed.

That proposal is Chris Olah’s Microscope AI. Here’s my rendition of a training story for Microscope AI:

“The training goal of Microscope AI is a purely predictive model that internally makes use of human-understandable concepts to be able to predict the data given to it, without reasoning about the effects of its predictions on the world. Thus, we can think of Microscope AI’s training goal as having two key components:

  1. the model doesn’t try to optimize anything over the world, instead being composed solely of a world model and a pure predictor; and

  2. the model uses human-understandable concepts to do so.

The reason that we want such a model is so that we can do transparency and interpretability on it, which should hopefully allow us to extract the human-understandable concepts learned by the model. Then, the idea is that this will be useful because we can use those concepts to help improve human understanding and decision-making.

The plan for getting there is to do self-supervised learning on a large, diverse dataset while using transparency tools during training to check that the correct training goal is being learned. Primarily, the training rationale is to use the nudge of an inductive bias towards simplicity to ensure that we get the desired training goal. This relies on it being the case that the simplest algorithm that’s implementable on a large neural network and successfully predicts the training data is a straightforward/​pure predictor—and one that uses human-understandable concepts to do so. The use of transparency tools during training is then mostly just to verify that such a nudge is in fact sufficient, helping to catch the presence of any sort of agentic optimization so that training can be halted in such a case.”

Now, we’ll try to evaluate Microscope AI’s training story using our four criteria from above:

  1. Training goal …

    1. … alignment: Training goal alignment for Microscope AI might seem trivial, as it seems like the training goal of a purely predictive model just shouldn’t be dangerous.

      However, there are potential safety issues even with purely predictive models—in particular, once a predictor starts predicting a world that involves itself, it runs into self-reference problems that might have multiple fixed points, some of which could be quite bad. For example: a pure predictor might predict that the world will be destroyed and replaced by a new, very easy-to-predict world in such a way that causes precisely that to happen. Exactly that scenario would likely require the predictor to be choosing its predictions to optimize the world to be easy to predict, which might be ruled out by the training goal (depending on exactly how it’s specified), but the general problem of how a predictor should handle self-fulfilling prophecies remains regardless. Though the training goal that I gave previously enforces that the model not be “reasoning about the effects of its predictions on the world,” exactly how to do that, given that its predictions are in fact a part of the world, is non-trivial. For more detail on this sort of scenario, see Abram Demski’s “The Parable of Predict-O-Matic.”

      Fortunately for Microscope AI, however, such a scenario should hopefully be off the table, as the goal isn’t to ever actually use the model’s predictions, but rather just to extract the concepts the model is using to make its predictions—and since the model is just trained on prediction, it shouldn’t ever have to predict anything that depends on its own predictions (though this depends on the variant of self-supervised training being used). However, it’s unclear whether that’s enough to fully address such concerns—the model’s internal concepts can be thought of as a type of prediction/​output, and if the model is actively selecting those concepts to achieve a particular result, as above, that could be quite bad. Thus, for Microscope AI to succeed on training goal alignment, the training goal specification likely needs to be quite firm in ruling out all possible such optimization over the world.

    2. … competitiveness: There are a couple of major factors to address regarding training goal competitiveness for Microscope AI.

      First is whether interpreting the training goal would actually meaningfully enhance human understanding. For that to be the case, transparency and interpretability tools would have to be capable of extracting useful knowledge that humans can understand but currently don’t. For example, if transparency and interpretability tools were limited to just determining whether some piece of information exists in a model, Microscope AI would be useless, as only information that humans already knew to look for could be extracted. Furthermore, the information extracted has to actually be information that’s useful to humans in making the sorts of decisions that humans want to make effectively—which could be especially difficult given that the actual information in the model is only ever optimized for prediction, not for making decisions, which could end up making it not very useful for the process of decision-making.

      Second, for Microscope AI to be training goal competitive, enhanced human understanding alone—without the ability to build any agentic systems—has to be sufficient for the economic use cases where one might otherwise want highly agentic advanced AI (e.g. an AGI). From “An overview of 11 proposals:”

      This question is likely to depend heavily on what exactly those use cases are. Like with amplification, if you need lots of fine motor control, microscope AI is unlikely to get you there. Furthermore, unlike amplification, if you need lots of low-level decision-making where it’s too expensive to hire a human, microscope AI won’t help much there either (whereas amplification would be fine). Potentially microscope AI could give humans the knowledge to safely build other systems which could solve such tasks, however. Furthermore, if the primary use case for AGI is just high-level big-picture decision-making (automating CEOs or doing AI research, for example), then it seems likely that microscope AI would have a real shot of being able to address those use cases. In that sort of a situation—where you’re only trying to make a small number of high-quality decisions—it seems likely to be fairly cheap to have a human in the loop and thus simply improving that human’s knowledge and understanding via microscope AI might be sufficient to produce competitive decision-making. This is especially true if there is a market premium on having a human making the decisions, perhaps because that makes it easier to negotiate or work with other humans.

  2. Training rationale …

    1. … alignment: In discussing training goal alignment, we established that the training goal needs to quite firmly rule out any optimization over the world. This creates a difficult challenge for the training rationale, however, as enforcing the absence of optimization over the world could be quite difficult.

      In particular, predicting a world that involves optimization might require optimization, potentially pushing any predictive model towards becoming an agent. In particular, any world model which needs to be able to answer queries of the form “what is the probability that is the optimal action according to utility ” or “what is the best action for agent to take in state ” would likely have to implicitly be running some sort of an optimization algorithm. Given that the world does in fact contain lots of optimizers—humans, for example—being able to handle such queries seems likely to be necessary to be able to efficiently predict the world. Thus, enforcing something like “pure prediction” could be quite difficult to do while also maintaining accuracy on questions involving humans. Furthermore, even if such enforcement could be done, it seems like it would put a very large burden on the transparency tools being used to do that enforcement, as the training process itself would likely be fighting against the enforcement mechanism—since just using optimization would still be a good way for the training process to modify the model to predict the data well. This could be especially concerning if it allows the model to start performing optimization and become deceptive before that optimization can be detected.

      Furthermore, the training goal also requires that the model not just be a pure predictor, but also be using human-level concepts to do its prediction. While it might seem like this would just be falsified in the same way as the cat detection training story, Chris Olah argues that more powerful models should actually be more likely to use human-level concepts, at least up until those models get human-level capabilities. Under such a worldview, we only see models learning non-human abstractions because we are currently operating in a “valley of confused abstractions,” but once we get human-level models, they’ll learn human-level concepts. Worryingly, however, Chris also predicts that, as our models then surpass human-level, we’ll start to get “increasingly alien abstractions,” which could again cause problems for Microscope AI.

    2. … competitiveness: Training rationale competitiveness is one area where Microscope AI generally does pretty well, as self-supervised learning is something we already know how to do and do efficiently. The biggest potential training rationale competitiveness issue, however, would be if the use of transparency tools during training to enforce the training goal—e.g. to check for optimization—significantly slowed down the training process or were otherwise too expensive. For example—if it’s necessary for humans to use transparency tools to fully reevaluate the model at each training step, that could end up being pretty uncompetitive. As such, it seems likely that we’ll need at least some progress in automated transparency to make Microscope AI’s training rationale competitive.

Compared to my previous analysis of Microscope AI, I think that this version is much more clear, easy to evaluate, and possible to locate concrete open problems in. For example, rather than my previous outer alignment analysis that simply stated that Microscope AI wasn’t outer aligned and wasn’t trying to be, we now have a very clear idea of what it is trying to be and an evaluation of that specific goal.

Exploring the landscape of possible training stories

Though I like the above Microscope AI example for showcasing one particular training story for building safe advanced AI and how it can be evaluated, I also want to spend some time looking into the broader space of all possible training stories. To do that, I want to look at some of the broad classes that training goals and training stories can fall into other than the ones that we just saw with Microscope AI. By no means should anything here be considered a complete list, however—in fact, my sense is that we’re currently only scratching the surface of all possible types of training goals and plans.

We’ll start with some possible broad classes of training goals.

  • Loss-minimizing models: Though of course all models are selected to minimize loss, they won’t necessarily have some internal notion of what the loss is and be optimizing for that—but a model that is actually attempting to minimize its loss signal is a possible training goal that you might have. Unfortunately, having a loss-minimizing model as your training goal could be a problem—for example, such a model might try to wirehead or otherwise corrupt the loss signal. That being said, if you’re confident enough in your loss signal that you want it to be directly optimized for, a loss-minimizing model is another possible training goal that you might aim for. However, getting a loss-minimizing model could be quite difficult, as “the loss signal” is not generally a very natural concept in most training environments—for example, if you train a model on the loss function of “going to as many red doors as possible,” you should probably expect it to learn to care about red doors rather than to care about the floating point number in the training process encoding the loss signal about red doors.

  • Fully aligned agents: Conceptually, a fully aligned agent is an agent that cares about everything that we care about and acts in the world to achieve those goals. Perhaps the most concrete proposal with such an agent as the training goal is ambitious value learning, where the idea is to learn a full model of what humans care about and then an agent that optimizes for that. Most proposals for building advanced AI systems have moved away from such a training goal, however, for good reason—it’s a very difficult goal to achieve.

  • Corrigible agents: When I previously quoted Paul’s definition of corrigibility, I said it wasn’t mechanistic enough to serve as a training goal. However, it certainly counts as a broad class of possible training goals. Perhaps the most clear example of a corrigible training goal, however, would be Paul Christiano’s concept of an approval-directed agent, an agent that is exclusively selecting each of its actions to maximize human approval—though note that there are some potential issues with the concept of approval-direction actually leading to corrigibility once translated into the sort of mechanistic/​algorithmic description necessary for a training goal specification.

  • Myopic agents: A myopic agent is an agent that isn’t optimizing any sort of coherent long-term goal at all—rather, myopic agents have goals that are limited in some sort of discrete way. Thus, in addition to being an example of a corrigible training goal, an approval-directed agent would also be a type of myopic training goal, as an approval-directed agent only optimizes over its next action, not any sort of long-term goal about the world. Paul refers to such agents that only optimize over their next action as act-based agents, making act-based agents a subset of myopic agents. Another example of a myopic training goal that isn’t act-based would be an LCDT agent, which exclusively optimizes its objective without going through any causal paths involving other agents.

  • Simulators: A model is a simulator if it’s exclusively simulating some other process. For example, a training goal for imitative amplification might be a model that simulates HCH. Alternatively, you could have a training goal of a physics simulator if you were working on something like AlphaFold, or a goal of having your GPT-style language model simulate human internet users. One important point to note about simulators as a training goal, however, is that it’s unclear how a pure simulator is supposed to manage its computational resources effectively to best simulate its target—e.g. how does a simulator choose what aspects of the simulation target are most important to get right? A simulator which is able to manage its resources effectively in such a way might just need to be some sort of an agent, though potentially a myopic agent—and in fact being able to act as such a simulator is the explicit goal of LCDT.

  • Narrow agents: I tend to think of a narrow agent as an agent that has a high degree of capability in a very specific domain, without having effectively any capability in other domains, perhaps never even thinking about/​considering/​conceptualizing other domains at all. An example of a proposal with a narrow agent as its training goal would be STEM AI, which aims to build a model that exclusively understands specific scientific/​technical/​mathematical problems without any broader understanding of the world. In that sense, narrow agents could also be another way of aiming for a sort of simulator that’s nevertheless able to manage its computational resources effectively by performing optimization only in the narrow domain that they understand.

  • Truthful question-answerers: In “Teaching ML to answer questions honestly instead of predicting human answers,” as I quoted previously, Paul Christiano describes the training goal as a model with “two parts: (i) a model of the world (and inference algorithm), (ii) a translation between the world-model and natural language. The intended model answers questions by translating them into the internal world-model.” What Paul is describing here isn’t an agent at all—rather, it’s purely a truthful question-answering system that accurately reports what its model of the world says/​predicts in human-understandable terms.

All of the above ideas are exclusively training goals, however—for any of them to be made into a full training story, they’d need to be combined with some specific training rationale for how to achieve them. Thus, I also want to explore what some possible classes of training rationales might look like. Remember that a training rationale isn’t just a description of what will be done to train the model—so you won’t see anything like “do RL” or even “do recursive reward modeling” on this list—rather, a training rationale is a story for how/​why some approach like that will actually succeed.

  • Capability limitations: One somewhat obvious training rationale—but that I think is nevertheless worth calling attention to, as I think it can often be quite useful—is analyzing whether a model would actually have the capabilities to do any sort of bad/​undesirable thing. For example: for many current systems, they may just not have the model capacity to learn the sorts of algorithms—e.g. optimization algorithms—that might be dangerous. To make these sorts of training rationales maximally concrete and falsifiable, I think a good way to formulate a training rationale of this form is to isolate a particular sort of capability that is believed to be necessary for a particular type of undesirable behavior and combine that with whatever evidence there is for why a model produced by the given training process wouldn’t have that capability. For example: if the ability to understand how to deceive humans is a necessary capability for deception, then determining that such a capability would be absent could serve as a good training rationale for why deception wouldn’t occur. Unfortunately, current large language models seem to be capable of understanding how to deceive humans, making that specific example insufficient.

  • Inductive bias analysis: Inductive bias analysis is the approach of attempting to carefully understand the inductive biases of a training process enough to be able to predict what sort of model will be learned. For example, any approach which attempts to predict what the “simplest” model will be given some training procedure and dataset is relying on inductive bias analysis—as in both the cat detection and Microscope AI training stories that we’ve seen previously.

    Inductive bias analysis is a very tempting approach, as it allows us to essentially just do standard machine learning and have a good idea of what sort of model it’ll produce. Unfortunately, once you start being very careful about your inductive bias analysis and working everything out mathematically—as in “Answering questions honestly instead of predicting human answers: lots of problems and some solutions”—it starts to get very tricky and very difficult to do successfully. This is especially problematic given how inductive bias analysis essentially requires getting everything right before training begins, as a purely inductive-bias-analysis-based training rationale doesn’t provide any mechanism for verifying that the right training goal is actually being learned during training.

    Hopefully, however, more results like deep double descent, lottery tickets, scaling laws, grokking, or distributional generalization will help us build better theories of neural network inductive biases and thus become more confident in any inductive-bias-analysis-based training stories.

  • Transparency and interpretability: As we saw in Microscope AI’s use of transparency tools to check for unwanted optimization/​agency, the use of transparency tools during training can be a very useful component of a training rationale, helping to verify that the right sort of algorithm is being learned. Though the training story I gave above for Microscope AI stated that it was primarily relying on inductive bias analysis, an approach that primarily relies on transparency tools would also be a possibility. Even then, however, some inductive bias analysis would likely still be necessary—e.g. “We think that our transparency checks will rule out all simple models that don’t fit the training goal, with all remaining models that don’t fit the goal being too complex according to the inductive biases of the training process to possibly be learned.”

    It’s worth noting, however, that all of the above uses of transparency tools rely on worst-case transparency—that is, the ability to actively check for a particular problem anywhere in a model rather than just the ability to understand some particular part of a model—which is something that transparency and interpretability currently still struggles with. Nevertheless, I think that transparency-and-interpretability-based training rationales are some of the most exciting, as unlike inductive bias analysis, they actually provide feedback during training, potentially letting us see problems as they arise rather than having to get everything right in advance.

  • Automated oversight: One way to significantly enhance the utility of transparency and interpretability tools is to not purely rely on humans being the ones deploying them—both because humans are slow and expensive, but also because humans are only capable of understanding human-level concepts. Thus, if you expect models to use concepts that are complex, alien, or otherwise difficult for humans to understand—as in the “increasingly alien abstractions” part of Chris Olah’s graph of interpretability vs. model strength—then using models that understand those concepts to do the interpretability work could potentially be a good way to ensure that interpretability continues working in such a regime.

    Of course, this raises the issue of how to ensure that the models doing the interpretability/​oversight are themselves safe. One solution to this problem is to use a form of recursive oversight, in which the overseer model and the model being overseen are the same model, variants of the same model, or otherwise recursively overseeing each other. For a more thorough exploration of what such an approach might look like, see “Relaxed adversarial training for inner alignment.”

  • AI cognitive science: In addition to the “neuroscience approach” of using transparency and interpretability to understand what our models are doing—since transparency is about looking inside models “brains”—there is also the “cognitive science” approach of proposing theories about what models are doing internally and then testing them via behavioral experiment. An example of this sort of approach would be Deletang et al.’s “Causal Analysis of Agent Behavior for AI Safety,” wherein the authors construct causal models of what agents might be doing and then test them through causal interventions.

    One problem with this style of approach, however, is that it gives us much less direct information about what algorithm a model might be implementing, as it still relies on purely behavioral information about what the model appears to do, rather than structural information about what the model is doing internally that transparency approaches could hopefully produce. Thus, training rationales based on AI cognitive science approaches might have to rely on some degree of extrapolation from experiments on other, similar models—extrapolation that could have difficulty predicting new problems that only arise with larger/​more powerful systems, which is a potential issue for any training rationale primarily based on this sort of an approach.

  • Precursor checking: Another general type of training rationale that I think is worth calling attention to is what I’ll call “precursor checking,” which is the concept of using some method of gaining information about a model’s internals—e.g. transparency/​interpretability or AI cognitive science—to check for some precursor to bad behavior rather than the bad behavior itself. This could involve substituting in some narrower, easier to check training goal—that still falls within the broader actual training goal—as the target for the training rationale. For example, if your training rationale involves ensuring that you don’t get a deceptive model that’s actively trying to trick its training process, then rather than explicitly trying to look for such deception (which could be especially hard since a deceptive model might actively try to avoid detection), you could instead try to ensure that your model has a short horizon length in terms of how far ahead its planning. Such a plan might work better, since horizon length might be easier to guarantee in a training rationale while still being consistent with the desired training goal and hopefully ruling out the possibility of deception.[4] One issue with this sort of approach, however, is that you have to guarantee that whatever precursor for bad behavior you’re looking for is in fact a necessary condition for that bad behavior—if it turns out that there’s another way of getting that bad behavior that doesn’t go through the precursor, that could be a problem.

  • Loss landscape analysis: An extension of inductive bias analysis, I think of loss landscape analysis as describing the sort of inductive bias analysis that focuses on the path-dependence of the training process. For example: if you can identify large barriers in the loss landscape, you can potentially use that to narrow down the space of possible trajectories through model space that a training process might take and thus the sorts of models that it might produce. Loss landscape analysis could be especially useful if used in conjunction with precursor checking, since compared to pure inductive bias analysis, loss landscape analysis could help you say more things about what precursors will be learned, not just what final equilibria will be learned. Loss landscape analysis could even be combined with transparency tools or automated oversight to help you artificially create barriers in the loss landscape based on what the overseer/​transparency tools are detecting in the model at various points in training.

  • Game-theoretic/​evolutionary analysis: In the context of a multi-agent training setup, another type of training rationale could be to understand what sorts of models a training process might produce by looking at the game-theoretic equilibria/​incentives of the multi-agent setting. One tricky thing with this style of approach, however, is avoiding the assumption that the agents would actually be acting to optimize their given reward functions, since such an assumption is implicitly assuming that you get the training goal of a loss-minimizing model. Instead, such an analysis would need to focus on what sorts of algorithms would tend to be selected for by the emergent multi-agent dynamics in such an environment—a type of analysis that’s perhaps most similar to the sort of analysis done by evolutionary biologists to understand why evolution ends up selecting for particular organisms, suggesting that such evolutionary analysis might be quite useful here. For a more detailed exploration of what a training rationale in this sort of a context might look like, see Richard Ngo’s “Shaping safer goals.”

Given such a classification of training rationales, we can label various different AI safety approaches based on what sort of training goal they have in mind and what sort of training rationale they want to use to ensure that they get there. For example, Paul Christiano’s “Teaching ML to answer questions honestly instead of predicting human answers,” that I quoted from previously, can very straightforwardly be thought of as an exercise in using inductive bias analysis to ensure a truthful question-answerer.

Additionally, more than just presenting a list of possible training goals and training rationales, I hope that these lists open up the possibility for what other strategies for building safe advanced AI might be possible than those that have been previously proposed. This includes both novel ways to combine a training goal with a training rationale—e.g. what if you used inductive bias analysis to get a myopic agent or AI cognitive science to get a narrow agent?—as well as gesturing to the general space of possible training goals and plans that likely includes many more possibilities that we’ve yet to consider.

Training story sensitivity analysis

If we do start using training stories regularly for reasoning about AI projects, we’re going to have to grapple with what happens when training stories fail—because, as we’ve already seen with e.g. the cat detection training story from earlier, seemingly plausible training stories can and will fail. Ideally, we’d like it to always be the case that training stories fail safely: especially when it comes to particularly risky failure modes such as deceptive alignment, rather than risk getting a deceptive model, we’d much rather training just not work. Furthermore, if always failing safely is too difficult, we’ll need to have good guarantees regarding the degree to which a training story can fail and in what areas failure is most likely.

In all of these cases, I want to refer to this sort of work as training story sensitivity analysis. Sensitivity analysis in general is the study of how the uncertainty in the inputs to something affects its outputs. In the case of training stories, that means answering questions like “how sensitive is this training rationale to changes in its assumptions about the inductive biases of neural networks?” and “in the situations where the training story fails, how likely is it to fail safely vs. catastrophically?” There are lots of ways to start answering questions like this, but here are some examples of the sorts of ways in which we might be able to do training story sensitivity analysis:

  • If we are confident that some particular dangerous behavior requires some knowledge, capability, or other condition that we are confident that the model doesn’t have, then even if our training story fails, it shouldn’t fail in that particular dangerous way.

  • If we can analyze how other, similar, smaller, less powerful models have failed, we can try to extrapolate those failures to larger models to predict the most likely ways in which we’ll see training stories fail—especially if we aggressively red-team those other models first to look for all possible failure modes.

  • If we can get a good sense of the space of all possible low-loss models that might be learned by a particular training process, and determine which ones wouldn’t fit the training goal, we can get a good sense of some of the most likely sorts of incorrect models that our training process might learn.

  • If we can analyze what various different paths through model space a training process might take, we can look at what various perturbations of the desired path might look like, what other equilibria such a path might fall into, and what other paths might exist that would superficially look the same.

Hopefully, as we build better training stories, we’ll also be able to build better tools for their sensitivity analysis so we can actually build real confidence in what sort of model our training processes will produce.


  1. ↩︎

    It’s worth noting that there are ways to potentially build advanced or transformative AI that don’t assume the emergency of agency (and in fact might rely on the opposite) such as the aforementioned Microscope AI or STEM AI.

  2. ↩︎

    Obviously this isn’t fair because in neither of these cases was Paul trying to write a training goal; but nevertheless I think that the second example that I give is a really good example of what I think a training goal should look like.

  3. ↩︎

    For example, instead of using transparency and interpretability tools, you might instead try to make use of AI cognitive science, as I discuss in the final section on “Exploring the landscape of possible training stories.”

  4. ↩︎

    It’s worth noting that while guaranteeing a short horizon length might be quite helpful for preventing deception, a short horizon length alone isn’t necessarily enough to guarantee the absence of deception, since e.g. a model with a short horizon length might cooperate with future versions of itself in such a way that looks more like a model with a long horizon length. See “Open Problems with Myopia” for more detail here.